{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 集成学习有两大方法：boosting 和 Bagging。本文首先介绍 Bagging 算法的原理、流程。然后介绍最典型的 Bagging 算法——随机森林的流程，并总结其特点，最后再介绍随机森林的 sklearn 实现的使用。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Bagging算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bagging算法的原理"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Bagging 算法也是基于弱学习器的一种集成学习方法。与 Boosting 算法不同的是，在 Bagging 算法的训练中，各个基学习器之间没有依赖关系，不需要已知上一轮的学习器然后训练新的弱学习器。\n",
    "\n",
    "Bagging 是**基于某种采样的数据集**，**并行**训练出所有弱学习器，然后根据**某种策略**对弱学习器进行组合，最终得到一个强学习器的机器学习算法。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Bagging 算法的**关键问题**如下：\n",
    "1. 训练的数据集是如何采样的？\n",
    "2. 弱学习器的结合策略是什么？"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Q1：训练的数据集是如何采样的？\n",
    "\n",
    "> A1：采用随机放回抽样，参见模型的评估方法中的**自助法**（之前笔记有介绍）。\n",
    "\n",
    "> Q2：弱学习的结合策略是怎么样的？\n",
    "\n",
    "> A2：对于分类问题，采用投票法，具体来说就是将弱学习器中预测的类别最多的那个作为最终的预测结果。对于回归问题，采用平均法。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bagging算法的流程"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**流程图：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](https://miro.medium.com/max/1050/1*_pfQ7Xf-BAwfQXtaBbNTEg.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**形式化描述：**\n",
    "\n",
    "输入：样本集$D=\\{(x_1,y_1),(x_2,y_2),...(x_m,y_m)\\}$，弱学习器算法（可以是CART、神经网络等）, 弱分类器迭代次数$T$\n",
    "\n",
    "输出：最终的强分类器f(x)\n",
    "\n",
    "流程\n",
    "1. 对于$t=1,2...,T$:\n",
    "    1. 对训练集进行第$t$次随机采样，共采集$m$次，得到包含$m$个样本的采样集$D_t$\n",
    "    2. 用采样集$D_t$训练第$t$个弱学习器$G_t(x)$\n",
    "2. 如果是分类算法预测，则$T$个弱学习器投出最多票数的类别或者类别之一为最终类别。如果是回归算法，$T$个弱学习器得到的回归结果进行算术平均得到的值为最终的模型输出。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 随机森林算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "随机森林（Random Forest-RF）是在 Bagging 基础上进行改造，主要有：\n",
    "1. 使用 CART 作为弱学习器\n",
    "2. 在 CART 选择划分点时进行了改造。不同于一般决策树中选择最优属性的最优属性值作为划分点，RF 中的 CART 在划分时随机选择一些样本特征，数量小于$n$，假设为$n_{sub}$，然后在这$n_{sub}$个样本中选择最优属性作为划分点。\n",
    "    - 如果$n_{sub}=n$，则此时 RF 的 CART 相当于普通的 CART。$n_{sub}$越小，则模型泛化性越好，但对于训练集的拟合程度会变差。**即$n_{sub}$越小，模型的方差会减小，但是偏差会增大。**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 随机森林算法的流程"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "输入：样本集$D=\\{(x_1,y_1),(x_2,y_2),...(x_m,y_m)\\}$，弱分类器迭代次数$T$。\n",
    "\n",
    "输出：输出为最终的强分类器$f(x)$\n",
    "\n",
    "流程\n",
    "1. 对于$t=1,2...,T$:\n",
    "    1. 对训练集进行第$t$次随机采样，共采集$m$次，得到包含$m$个样本的采样集$D_t$\n",
    "    2. 用采样集$D_t$训练第$t$个决策树模型$G_t(x)$，在训练决策树模型的节点的时候，在节点上所有的样本特征中选择一部分样本特征，在这些随机选择的部分样本特征中选择一个最优的特征来做决策树的左右子树划分\n",
    "2. 如果是分类算法预测，则$T$个弱学习器投出最多票数的类别或者类别之一为最终类别。如果是回归算法，$T$个弱学习器得到的回归结果进行算术平均得到的值为最终的模型输出。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 随机森林的推广\n",
    "RF 性能良好，延伸出不少使用广泛的变体。它们不光可以用于分类回归，还可以用于特征转换，异常点检测等。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extra-Trees（Extremely randomized Trees，极端随机树）\n",
    "Extra-Trees 是 RF 的变体，基本和 RF 一样，只是有以下区别：\n",
    "1. 对于数据集来说，Extra-Trees 的每棵决策树应用的是相同的全部训练样本\n",
    "2. 在选择划分特征时，Extra-Trees 比较的**极端**，它会随机的选择一个特征值来划分决策树\n",
    "\n",
    "由于随机选择了特征值的划分点位，而不是最优点，这会导致生成的决策树的规模一般会大于 RF 所生成的决策树。也就是说，模型的方差相对于 RF 进一步减少，但是偏偏差相对于 RF 进一步增大。在某些时候，Extra-Trees 的泛化能力比 RF 更好。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Totally Random Trees Embedding\n",
    "> **该算法与支持向量机的一些理论知识有联系，因为还没学过，所以暂且跳过不表。**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Isolation Forest\n",
    "> Isolation Forest（以下简称 IForest）是一种**异常点检测**的方法。它也使用了类似于 RF 的方法来检测异常点。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TForest 和 RF 的区别在于：\n",
    "1. TForest 也要进行随机采样，但是采样个数不同于 RF（RF 要求采样个数等于训练集样本个数），具体来说采样个数要远小于训练集样本个数（Why？）。因为异常点检测一般只需要的少部分样本。\n",
    "2. 对于每一个决策树的建立， IForest 随机选择一个划分特征，对划分特征随机选择一个划分阈值。\n",
    "3. IForest 一般会选择一个比较小的最大决策树深度max_depth（原因同1）\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Q：异常点如何判断？\n",
    "![](异常点的判断.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 总结\n",
    "大白话的比较以下 Boosting 和 Bagging：\n",
    "1. Boosting 像是一个初学者不断的练习，然后修改错题，逐渐成为某些问题的专家。\n",
    "2. Bagging 像是一群初学者做题，然后集思广益，讨论出一个结果，能应对更丰富的问题。\n",
    "\n",
    "然而，专家再准确，对于不熟悉的问题也束手无策。一群初学者再集思广益，也很难在某些方向做的深刻。\n",
    "\n",
    "接下来总结下随机森林的优缺点。\n",
    "\n",
    "优点：\n",
    "1. 训练可以并行，在大规模样本上有很大优势。\n",
    "2. 由于随机选择决策树的划分特征，因此提高了效率。\n",
    "3. 训练后可输出不同特征的重要性。\n",
    "4. 由于采用了放回随机采样，因此减少了的模型方差，增强了泛化能力。\n",
    "5. 相比于 Boosting 算法，实现简单。\n",
    "6. 对特征的缺失值不敏感。\n",
    "\n",
    "缺点：\n",
    "1. 由于采用了放回随机采样，因此提高了模型偏差。\n",
    "2. 在某些噪音比较大的样本集上，RF 模型容易陷入过拟合。\n",
    "3. 取值划分比较多的特征容易对 RF 的决策产生更大的影响，从而影响拟合的模型的效果。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 随机森林算法的 sklearn 实践\n",
    "在 scikit-learn 中，RF 的分类类是 RandomForestClassifier，回归类是 RandomForestRegressor。\n",
    "\n",
    "RF 的参数可以分为 Bagging 框架参数和决策树参数以及其他参数。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "注：分类问题中，sklearn 随机森林的输出策略是取每个分类器预测概率的平均，而不是让每个分类器对类别进行投票。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### RF 的 Bagging 框架参数\n",
    "1. n_estimators: 表示最大的弱学习器的个数。n_estimators 太小，容易欠拟合，n_estimators 太大，计算量会太大，并且 n_estimators 到一定的数量后，再增大 n_estimators 获得的模型提升会很小，所以一般选择一个适中的数值。默认是100。\n",
    "2. oob_score :即是否采用袋外样本（随机抽样中未被抽样的样本）来评估模型的好坏。默认为 False。推荐设置为 True，因为袋外分数反应了一个模型拟合后的泛化能力。\n",
    "3. criterion: 表示 CART 做划分时对特征的评价标准。分类模型和回归模型的损失函数是不一样的。分类RF对应的CART分类树默认是基尼系数gini，另一个可选择的标准是信息增益。回归 RF 对应的 CART 默认是均方差mse，另一个可以选择的标准是绝对值差 mae。一般采用默认即可。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### RF 决策树参数\n",
    "1. max_features：表示 RF 划分时考虑的最大特征数。可以使用很多种类型的值，默认是\"auto\",意味着划分时最多考虑$\\sqrt{N}$个特征；如果是\"log2\"意味着划分时最多考虑$log_2N$个特征；如果是\"sqrt\"意味着划分时最多考虑$\\sqrt{N}$个特征。如果是整数，表示考虑的特征数。如果是浮点数，表示考虑特征百分比，即考虑 百分比×总样本数 取整后的特征数。一般选择默认即可，如果特征数非常多，可以灵活使用刚才描述的其他取值来控制划分时考虑的最大特征数，以控制决策树的生成时间。\n",
    "    - 注：回归问题中使用 max_features = None （总是考虑所有的特征）， 分类问题使用 max_features = \"sqrt\" （随机考虑 sqrt(n_features) 特征，其中 n_features 是特征的个数）是比较好的默认值\n",
    "2. max_depth：表示决策树最大深度。默认为`None`，这时决策树在建立子树的时候不会限制子树的深度。一般来说，数据少或者特征少的时候可以不管这个值。如果模型样本量多，特征也多的情况下，推荐限制这个最大深度，具体的取值取决于数据的分布。常用的可以取值10-100之间。\n",
    "3. min_samples_split: 表示内部节点再划分所需最小样本数。这个值限制了子树继续划分的条件，如果某节点的样本数少于min_samples_split，则不会继续再尝试选择最优特征来进行划分。默认是2。如果样本量不大，不需要管这个值。如果样本量数量级非常大，推荐增大这个值。\n",
    "    - 注：max_depth = None 和 min_samples_split = 2 结合通常会有不错的效果（即生成完全的树）。但默认未必是最佳，最佳参数需要交叉验证获得。\n",
    "4. min_samples_leaf：表示叶子节点最少样本数。这个值限制了叶子节点最少的样本数，如果某叶子节点数目小于样本数，则会和兄弟节点一起被剪枝。可以输入最少的样本数的整数，或者最少样本数占样本总数的百分比，默认是1。如果样本量不大，不需要管这个值。如果样本量数量级非常大，则推荐增大这个值。\n",
    "5. min_weight_fraction_leaf：表示叶子节点最小的样本权重。这个值限制了叶子节点所有样本权重和的最小值，如果小于这个值，则会和兄弟节点一起被剪枝。默认是0，即不考虑权重问题。一般来说，如果较多样本有缺失值，或者分类树样本的分布类别偏差很大，就会引入样本权重，这时就要注意这个值。\n",
    "6. max_leaf_nodes：表示最大叶子节点数。通过限制最大叶子节点数，可以防止过拟合，默认是`None`，即不限制最大的叶子节点数。如果加了限制，算法会建立在最大叶子节点数内最优的决策树。如果特征不多，可以不考虑这个值，但是如果特征分成多的话，可以加以限制，具体的值可以通过交叉验证得到。\n",
    "7. min_impurity_split：表示节点划分最小不纯度。这个值限制了决策树的增长，如果某节点的不纯度(基于基尼系数 or 均方差)小于这个阈值，则该节点不再生成子节点。一般不推荐改动默认值1e-7。\n",
    "\n",
    "决策树参数中最重要的包括 max_features，max_depth，min_samples_split 和 min_samples_leaf。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 其他参数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. n_jobs：进行并行化。如果设置 n_jobs = k ，则计算被划分为 k 个作业，并运行在机器的 k 个核上。 如果设置 n_jobs = -1 ，则使用机器的所有核。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 实例\n",
    "这里使用 RF 的分类类 RandomForestClassifier 进行泰塔尼克乘客获救预测。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "from sklearn import metrics\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "读取数据集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Braund, Mr. Owen Harris</td>\n",
       "      <td>male</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>A/5 21171</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
       "      <td>female</td>\n",
       "      <td>38.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>PC 17599</td>\n",
       "      <td>71.2833</td>\n",
       "      <td>C85</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Heikkinen, Miss. Laina</td>\n",
       "      <td>female</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>STON/O2. 3101282</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
       "      <td>female</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>113803</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>C123</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Allen, Mr. William Henry</td>\n",
       "      <td>male</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>373450</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   PassengerId  Survived  Pclass  \\\n",
       "0            1         0       3   \n",
       "1            2         1       1   \n",
       "2            3         1       3   \n",
       "3            4         1       1   \n",
       "4            5         0       3   \n",
       "\n",
       "                                                Name     Sex   Age  SibSp  \\\n",
       "0                            Braund, Mr. Owen Harris    male  22.0      1   \n",
       "1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   \n",
       "2                             Heikkinen, Miss. Laina  female  26.0      0   \n",
       "3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   \n",
       "4                           Allen, Mr. William Henry    male  35.0      0   \n",
       "\n",
       "   Parch            Ticket     Fare Cabin Embarked  \n",
       "0      0         A/5 21171   7.2500   NaN        S  \n",
       "1      0          PC 17599  71.2833   C85        C  \n",
       "2      0  STON/O2. 3101282   7.9250   NaN        S  \n",
       "3      0            113803  53.1000  C123        S  \n",
       "4      0            373450   8.0500   NaN        S  "
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "path = r'D:\\Li\\Git_repository\\ML-by-Python\\实战\\Titanic passenger rescue forecast\\train.csv'\n",
    "train = pd.read_csv(path)\n",
    "train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数据清洗"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "train.Age = train.Age.fillna(train.Age.median())  # 填充Age列的缺失值\n",
    "\n",
    "# 性别属性预处理 male => 0，female => 1\n",
    "train.loc[train.Sex == \"male\", \"Sex\"] = 0  \n",
    "train.loc[train.Sex == \"female\", \"Sex\"] = 1   \n",
    "\n",
    "#缺失值用最多的S进行填充\n",
    "train.Embarked = train.Embarked.fillna('S') \n",
    "\n",
    "#地点用0,1,2\n",
    "train.loc[train[\"Embarked\"] == \"S\", \"Embarked\"] = 0    \n",
    "train.loc[train[\"Embarked\"] == \"C\", \"Embarked\"] = 1\n",
    "train.loc[train[\"Embarked\"] == \"Q\", \"Embarked\"] = 2\n",
    "\n",
    "# 转换数据类型（这里必须将object类型转为int 否则训练会报错）\n",
    "train.Sex = train.Sex.astype('int')\n",
    "train.Embarked = train.Embarked.astype('int')\n",
    "\n",
    "# 特征选择\n",
    "feature_name = [\"Pclass\", \"Sex\", \"Age\", \"SibSp\", \"Parch\", \"Fare\", \"Embarked\"]\n",
    "\n",
    "# 准备训练集和测试集\n",
    "X = train[feature_name]\n",
    "y = train.Survived"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "训练模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.8103254769921436\n",
      "AUC Score (Train): 0.997127\n"
     ]
    }
   ],
   "source": [
    "rf0 = RandomForestClassifier(oob_score=True, random_state=10)\n",
    "rf0.fit(X,y)\n",
    "\n",
    "print(rf0.oob_score_)\n",
    "y_predprob = rf0.predict_proba(X)[:,1]\n",
    "print(\"AUC Score (Train): %f\" % metrics.roc_auc_score(y, y_predprob))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对参数 n_estimator 进行网格搜索"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "({'n_estimators': 30}, 0.8548807421217027)"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "param_test1 = {'n_estimators':range(10,71,10)}\n",
    "gsearch1 = GridSearchCV(estimator = RandomForestClassifier(min_samples_split=100,\n",
    "                                  min_samples_leaf=20, max_depth=8, max_features='sqrt', random_state=10), \n",
    "                       param_grid=param_test1, scoring='roc_auc', cv=5)\n",
    "gsearch1.fit(X, y)\n",
    "gsearch1.best_params_, gsearch1.best_score_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对参数 max_depth 和 min_samples_split 进行网格搜索"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "D:\\Anaconda\\lib\\site-packages\\sklearn\\model_selection\\_search.py:849: FutureWarning: The parameter 'iid' is deprecated in 0.22 and will be removed in 0.24.\n",
      "  \"removed in 0.24.\", FutureWarning\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "({'max_depth': 5, 'min_samples_split': 130}, 0.8585101306360018)"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "param_test2 = {'max_depth':range(3,14,2), 'min_samples_split':range(50,201,20)}\n",
    "gsearch2 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 30, \n",
    "                                  min_samples_leaf=20,max_features='sqrt' ,oob_score=True, random_state=10), param_grid = param_test2, scoring='roc_auc',iid=False, cv=5)\n",
    "gsearch2.fit(X,y)\n",
    "gsearch2.best_params_, gsearch2.best_score_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "查看模型的袋外分数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.7946127946127947\n"
     ]
    }
   ],
   "source": [
    "rf1 = RandomForestClassifier(n_estimators= 30, max_depth=5, min_samples_split=130,\n",
    "                                  min_samples_leaf=20, max_features='sqrt', oob_score=True, random_state=10)\n",
    "rf1.fit(X,y)\n",
    "print(rf1.oob_score_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对参数 min_samples_split 和 min_samples_leaf 进行网格搜索"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "D:\\Anaconda\\lib\\site-packages\\sklearn\\model_selection\\_search.py:849: FutureWarning: The parameter 'iid' is deprecated in 0.22 and will be removed in 0.24.\n",
      "  \"removed in 0.24.\", FutureWarning\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "({'min_samples_leaf': 10, 'min_samples_split': 120}, 0.85997030062705)"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "param_test3 = {'min_samples_split':range(80,150,20), 'min_samples_leaf':range(10,60,10)}\n",
    "gsearch3 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 30, max_depth=5,\n",
    "                                  max_features='sqrt' ,oob_score=True, random_state=10), param_grid = param_test3, scoring='roc_auc',iid=False, cv=5)\n",
    "gsearch3.fit(X,y)\n",
    "gsearch3.best_params_, gsearch3.best_score_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对参数 max_features 进行网格搜索"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'max_features': 2} 0.85997030062705\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "D:\\Anaconda\\lib\\site-packages\\sklearn\\model_selection\\_search.py:849: FutureWarning: The parameter 'iid' is deprecated in 0.22 and will be removed in 0.24.\n",
      "  \"removed in 0.24.\", FutureWarning\n"
     ]
    }
   ],
   "source": [
    "param_test4 = {'max_features': range(1,len(feature_name))}  #  max_features 的最大值不能超过 feature_name 的长度\n",
    "gsearch4 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 30, max_depth=5, min_samples_split=120,\n",
    "                                  min_samples_leaf=10, oob_score=True, random_state=10), param_grid = param_test4, scoring='roc_auc',iid=False, cv=5)\n",
    "gsearch4.fit(X,y)\n",
    "print(gsearch4.best_params_, gsearch4.best_score_)\n",
    "best_model = gsearch4.best_estimator_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "获取到最佳模型后，再次查看袋外分数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.8013468013468014\n"
     ]
    }
   ],
   "source": [
    "best_model.fit(X, y)\n",
    "print(best_model.oob_score_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reference\n",
    "0. 《机器学习》周志华\n",
    "1. https://www.cnblogs.com/pinard/p/6156009.html\n",
    "2. https://www.cnblogs.com/pinard/p/6160412.html"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
